Python - pandas Data Integration (concat, join, merge)

Table of Contents

This article explains methods for integrating multiple datasets into one using the pandas library.

There are several methods for integrating data, but this time we will cover concat, join, merge.

Before we explain, let’s create two example data frames.

>>> import pandas as pd 

>>> df1 = pd.DataFrame({
    'Class1' : [95, 92, 98, 100],
    'Class2' : [91, 93, 97, 99]
})

>>> df2 = pd.DataFrame({
    'Class1' : [87, 89],
    'Class2' : [85, 90]
})

Output of d1:

	Class1	Class2
0	87	85
1	89	90

Output of d2:

	Class1	Class2
0	95	91
1	92	93
2	98	97
3	100	99

1. concat #

The concat function of the pandas library is used to append data frames. This function can append multiple data frames in either row or column direction.

Let’s append df1 and df2 to the result.

>>> result = pd.concat([df1, df2])
>>> result

	Class1	Class2
0	95	91.0
1	92	93.0
2	98	97.0
3	100	99.0
4	87	85.0
5	89	90.0
6	96	NaN
7	83	NaN

pd.concat([df1, df2]) appends df1 and df2 in the row direction. That is, the two data frames are connected vertically.

Next, let’s integrate d3, which only has the Class1 column, into the result.

>>> df3 = pd.DataFrame({
    'Class1' : [96, 83]
})

>>> pd.concat([result, df3], ignore_index=True)

	Class1	Class2
0	95	91.0
1	92	93.0
2	98	97.0
3	100	99.0
4	87	85.0
5	89	90.0
6	96	NaN
7	83	NaN

Since the d3 data does not have the ‘Class2’ column, it outputs blank values.

2. join #

The join method of the pandas library is used to combine two data frames based on a specific column. It generally performs a role similar to SQL’s JOIN operation. Unlike concat, join integrates horizontally.

>>> df4 = pd.DataFrame({
    'Class3' : [93, 91, 95, 98]
})

>>> df1.join(df4)

	Class1	Class2	Class3
a	95	91	93
b	92	93	91
c	98	97	95
d	100	99	98

It is also possible to output by arbitrarily setting the index as follows.

>>> index_label = ['a','b','c','d']
>>> df1a = pd.DataFrame({'Class1': [95, 92, 98, 100],
                    'Class2': [91, 93, 97, 99]}, index= index_label)
>>> df4a = pd.DataFrame({'Class3': [93, 91, 95, 98]}, index=index_label)

>>> df1a.join(df4a)

	Class1	Class2	Class3
a	95	91	93
b	92	93	91
c	98	97	95
d	100	99	98

3. merge #

The merge function of the pandas library is used to merge (integrate) two data frames based on a specific column. Using the merge function, you can combine data frames based on common columns between them.

>>> df_A_B = pd.DataFrame({'Sales Month': ['January', 'February', 'March', 'April'],
                       'Product A': [100, 150, 200, 130],
                       'Product B': [90, 110, 140, 170]})

>>> df_C_D = pd.DataFrame({'Sales Month': ['January', 'February', 'March', 'April'],
                       'Product C': [112, 141, 203, 134],
                       'Product D': [90, 110, 140, 170]})

df_A_B

	Sales Month	Product A	Product B
0	January	100	90
1	February	150	110
2	March	200	140
3	April	130	170

df_C_D

	Sales Month	Product C	Product D
0	January	112	90
1	February	141	110
2	March	203	140
3	April	134	170

Use merge to merge the two data frames based on the ‘Sales Month’ column. As a result, the two data frames are combined based on the ‘Sales Month’ column, and the data is organized around the common columns.

>>> df_A_B.merge(df_C_D)

	Sales Month	Product A	Product B	Product C	Product D
0	January	100	90	112	90
1	February	150	110	141	110
2	March	200	140	203	140
3	April	130	170	134	170

Let’s implement four different ways to combine two data frames using the merge method.

>>> df_left = pd.DataFrame({'key':['A','B','C'], 'left': [1, 2, 3]})
>>> df_right = pd.DataFrame({'key':['A','B','D'], 'right': [4, 5, 6]})

>>> df_left.merge(df_right, how='left', on = 'key')

	key	left	right
0	A	1	4.0
1	B	2	5.0
2	C	3	NaN

Left join df_left and df_right based on the ‘key’ column. A left join keeps all rows from the left data frame (df_left) and adds rows from the right data frame (df_right) that have a matching key value. If a matching key value is not present in the right data frame, it is filled with NaN.

>>> df_left.merge(df_right, how='right', on = 'key')

	key	left	right
0	A	1.0	4
1	B	2.0	5
2	D	NaN	6

Right join df_left and df_right based on the ‘key’ column. A right join keeps all rows from the right data frame (df_right) and adds rows from the left data frame (df_left) that have a matching key value. If a matching key value is not present in the left data frame, it is filled with NaN.

>>> df_left.merge(df_right, how='outer', on = 'key')

	key	left	right
0	A	1.0	4.0
1	B	2.0	5.0
2	D	3.0	NaN
3	D	NaN	6.0

Outer join df_left and df_right based on the ‘key’ column. An outer join includes all rows from both data frames, filling with NaN where a match is only present in one of the data frames.

>>> df_left.merge(df_right, how='inner', on = 'key')

	key	left	right
0	A	1	4
1	B	2	5

Inner join df_left and df_right based on the ‘key’ column. An inner join includes only rows that are common to both data frames. That is, it combines rows from both data frames that have the same ‘key’ value.